Part I - Prosper Loan Data Analysis

by Doron Dusheiko

Introduction

A dataset of 113,937 loans is provided by Prosper. Each loan has 81 variables including loan amount, borrower rate/interest rate, loan status, borrower income and many others.

The goal of this analysis is to answer the following question: What affects the borrowers interest rate?

Preliminary Wrangling

Given the large number of variables, I'll explore each by data type

Before working with the data, I'd like to make a few changes so that its easier to work with:

What is the structure of your dataset?

There are 113937 loans in the dataset with 81 features. Most variables are numeric, however CreditGrade, ProsperRating (Alpha) and IncomeRange are ordinal while LoanStatus, Occupation and EmploymentStatus are nominal categorical variables. CreditGrade is the rating of the customer pre-July 2099 and ProsperRating (Alpha) (along with PropserRating (numeric) and ProsperScore) is the rating post-July 2009.

What is/are the main feature(s) of interest in your dataset?

I'm primarily interested in the understanding which variables affect the interest rate offered to the borrower.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

There are a lot of possibilities with this data set, however I will focus on the credit grade/prosper rating, occupation, employment state, income range, number of credit lines ov erthe last 7 years, number of inquiries over last 6 months, delinqiencies over last 7 years, public records over last 12 months, debt to income ratio, total number of propser loans, monthly loan payment, listing category and the original loan amount requested. I feel these variables are likely to be the primary factors in deciding on a interest rate offered to the customer.

At a high-level, I'm putting down some ideas for later explorations, across univariate, bivariat and multivariate analysis:

Univariate Exploration

First let's investigate our primary variable of interest, the interest rate

The distribution is slightly right-skewed, with the majority of interest rates being 0.1 and 0.2%. Let's investigate further

There is a large spike at 0.3177, might be interested to see whats common amongst these specific applications later.

Although all these applications have the same interest rate, they have a reasonably wide range of Prosper scores assigned to them, although 4 and 5 are the most common. This means that prosper score alone is not the sole factor dictating the interest rate offered.

Lets get a high-level glimpse of the distribution of our variables, before drilling into the details

Under the old scoring system, we can see that scores C and D were most common, while C and B are most common under the new system. Its hard to read some of the other graphs as there is too much info, so let's investigate them individually.

Now we can see that most people are Employed (and likely Full-time employed) and least are not employed or retired. Most common income ranges are \$25K - \\$50K and \$50K - \\$75K and CA is the most common state. Other and Professional are the most common, which are likely just "catch-all" categories, this doesnt give us much info. Interestingly though, Computer Programmers are the most common "non-generic" listed profession. Most types of students form the bottom 10 listed professions, as one might expect.

Lets have a quick review of the numerical and boolean variables, in order to check for outliers

Initial insights:

Many of the variables above seem to have outliers, based on the heavily right-skewed distribution. Let's see if we can study them

Let's isolate the outliers and study them

Its not clear that there is anything wrong with these loans, however it will be good to view the effect on the distribution once they are removed.

Now its a little more clear that:

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The distribution is slightly right-skewed, with the majority of interest rates being 0.1 and 0.2%. There is a large spike at 0.3177. Although all these applications have the same interest rate, they have a reasonably wide range of Prosper scores assigned to them, although 4 and 5 are the most common. This means that prosper score alone is not the sole factor dictating the interest rate offered. The variables wasn't transformed however a more suitable bin size was chosen to get a more detailed view of the variable.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

CreditGrade is used prior to July 2009, while ProsperRating (alpha and numeric)/ProsperScore are used from July 2009. A new IsNewRating yes/no flag was added that allows me to easily filter one or the other. I thought about melting the different types of ratings columns, however since there score values might be different in the different time periods, I don't feel this is a good idea. Decided to convert Occupation and EmploymentStatus to nominal categorical variables and IncomeRange, CreditGrade and ProsperRatingAlpha to a ordinal categorical variable. Although ListingCategory (numeric) is a numeric field, I feel its well suited to being a nominal categorical variable instead, so it was converted as well. A few columns were also renamed to make them easier to work with.

In terms of the distribution study, we determined:

Bivariate Exploration

Let's have a look at pair-wise correlations between our numerical variables.

We see some postiive correlations between monthly loan payment and loan original amount, as one would expect and negative correlations between borrower rate and prosperscore, i.e. the worse the score, the higher the rate. This doesn't give us too much info though as the prosper score is likely a function of all the other variables. As such, its an expoected correlation which doesn't describe why the borrower rate is what it is.

Credit score has a medium strong negative correlation with borrower rate, i.e. the lower the credit score, the higher the borrower rate. Many of the other numeric variable have a weak linear correlation with borrower rate, but its possible that their relationship is non-linear with borrower rate.

Let's try bring some of these non-linearities out now

It seems like there is positive correlation between the credit score and the loan amount offered, but otherwise I don't feel there is much to be gained from this plot. Let's take a look at the relationship between borrower rate and some of the categorical variables

Lets have a look at the relationship between some of the categorical variables

Correlates with what we saw earlier, i.e. highest earners will tend to be full-time employees, D ProsperRating is most common for those in the 25K - 50L range and C in the 50K - 75K range

Let's dive into the relationship between some of our other variables

Interestingly, home owners tend to have a lower debit to income ratio while having more credit lines over the last 7 years. Likewise, being a home owner tends to attract a lower borrower rate.

Lets delve further into the relationship between borrower rate, home ownership and debt to income ratio

The distribution for DebtToIncome is quite similar whether a home owner or not, while being a home owner does attract a slightly better borrower rate, as we already observed above.

Let's see if we can spot a relationship between employee status duration and borrower rate

Doesnt seem like there is much insight to be gained here.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Multivariate Exploration

I'd like to explore the relationship between Borrower Rate, Home Ownership, Credit Rating and Income Range.

Its hard to make an inferences from this since the upper end of the credit range seems to be quite similar across the board

We know that the Credit Grade/Prosper Score correlates with the borrower rate, but its not clear what contributes to that credit grade or prosper score, so let's unpack that a little

We can see from this that ProsperRatingAlpha is the clearest indiator of the borrower rate, so we need to fully understand what leads to this prosper rating. We know from earlier that Employment Status and Income Range are important factors, so let's explore their relationship to Prosper Rating

For each prosper rating, we see almost equal contributions for all the ranges of income.

For each prosper rating, we see almost equal contributions whether the person is a home owner or not

Lets see how credit score varies across income range and prosper rating

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

It is clear the the Prosper Rating given to a loan is the determining factor in what dictates the borrower rate, however I wasn't able to clearly see strong correlations with what led to the specific Prosper Rating. Credit rating, home ownership, income range, employment status all contribute somewhat, but a deeper analysis is required to further understand these and other relationships not investigated.

Were there any interesting or surprising interactions between features?

I found it interedting that although income range does contribute to the prosper rating and borrower rate, you still see borrower rates from low to high with a mix of income ranges, indicating that there are other factors at play.

Conclusions

From our analysis we can determine the following: